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Abstract 

Stochastic convex optimization is a basic and well studied primitive in machine 
learning. It is well known that convex and Lipschitz functions can be minimized effi¬ 
ciently using Stochastic Gradient Descent (SGD). 

The Normalized Gradient Descent (NGD) algorithm, is an adaptation of Gradient 
Descent, which updates according to the direction of the gradients, rather than the 
gradients themselves. In this paper we analyze a stochastic version of NGD and prove 
its convergence to a global minimum for a wider class of functions: we require the 
functions to be quasi-convex and locally-Lipschitz. Quasi-convexity broadens the con¬ 
cept of unimodality to multidimensions and allows for certain types of saddle points, 
which are a known hurdle for first-order optimization methods such as gradient descent. 
Locally-Lipschitz functions are only required to be Lipschitz in a small region around 
the optimum. This assumption circumvents gradient explosion, which is another known 
hurdle for gradient descent variants. 

Interestingly, unlike the vanilla SGD algorithm, the stochastic normalized gradient 
descent algorithm provably requires a minimal minibatch size. 


1 Introduction 

The benefits of using the Stochastic Gradient Descent (SGD) scheme for learning could not 
be stressed enough. For convex and Lipschitz objectives, SGD is guaranteed to find an e- 
optimal solution within 0(l/e^) iterations and requires only an unbiased estimator for the 
gradient, which is obtained with only one (or a few) data samples. However, when applied to 
non-convex problems several drawbacks are revealed. In particular, SGD is widely used for 
deep learning Bengio (2009), one of the most interesting helds where stochastic non-convex 
optimization problems arise. Often, the objective in these kind of problems demonstrates 
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two extreme phenomena: on the one hand plateaus, i.e., regions with very small gradients; 
and on the other hand sharp cliffs, i.e., exceedingly high gradients. As is expected, applying 
SGD to these problems is often reported to yield unsatisfactory results. 

In this paper we analyze a stochastic version of the Normalized Gradient Descent (NGD) 
algorithm, which we denote by SNGD. Each iteration of SNGD is as simple and efficient 
as SGD, but is much more appropriate for non-convex optimization problems, overcoming 
some of the pitfalls that SGD may encounter. Particularly, we define a family of locally-quasi- 
convex and locally-Lipschitz functions, and prove that SNGD is suitable for optimizing such 
objectives. 

Local-Quasi-convexity is a generalization of unimodal functions to multidimensions, which 
includes quasi-convex, and convex functions as a subclass. Locally-Quasi-convex functions 
allow for certain types of plateaus and saddle points which are difficult for SGD and other 
gradient descent variants. Local-Lipschitzness is a generalization of Lipschitz functions that 
only assumes Lipschitzness in a small region around the minima, whereas farther away the 
gradients may be unbounded. Gradient explosion is, thus, another difficulty that is success¬ 
fully tackled by SNGD and poses difficulties for other stochastic gradient descent variants. 

Our contributions: 

• We introduce local-quasi-convexity, a property that extends quasi-convexity and cap¬ 
tures unimodal functions which are not quasi-convex. We prove that NGD finds an 
e-optimal minimum for such functions within 0(l/e^) iterations. As a special case, we 
show that the above rate can be attained for quasi-convex functions that are Lipschitz 
in an D(e)-region around the optimum (gradients may be unbounded outside this re¬ 
gion). For objectives that are also smooth in an D(yA)-region around the optimum, 
we prove a faster rate of 0(l/e). 

• We introduce a new setup: stochastic optimization of locally-quasi-convex functions; 
and show that this setup captures Generalized Linear Models (GLM) regression, Mc- 
Gullagh and Nelder (1989). For this setup, we devise a stochastic version of NGD 
(SNGD), and show that it converges within 0(l/e^) iterations to an e-optimal mini¬ 
mum. 

• The above positive result requires that at each iteration of SNGD, the gradient should 
be estimated using a minibatch of a minimal size. We provide a negative result showing 
that if the minibatch size is too small then the algorithm might indeed diverge. 

• We report experimental results supporting our theoretical guarantees and demonstrate 
an accelerated convergence attained by SNGD. 

1.1 Related Work 

Quasi-convex optimization problems arise in numerous fields, spanning economics Varian 
(1985); Laffont and Martimort (2009), industrial organization Wolfstetter (1999) , and com¬ 
puter vision Ke and Kanade (2007). It is well known that quasi-convex optimization tasks 
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can be solved by a series of convex feasibility problems Boyd and Vandenberghe (2004); 
However, generally solving snch feasibility problems may be very costly Goffin et al. (1996). 

There exists a rich literatnre concerning qnasi-convex optimization in the offline case, 
Polyak (1967); Zabotin et al. (1972); Khabibnllin (1977); Sikorski (1986). A pioneering 
paper by Nesterov (1984), was the hrst to snggest an efficient algorithm, namely Normalized 
Gradient Descent, and prove that this algorithm attains e-optimal solntion within 0(l/e^) 
iterations given a differentiable qnasi-convex objective. This work was later extended by 
Kiwiel (2001), showing that the same resnlt may be achieved assnming npper semi-continnous 
qnasi-convex objectives. In Konnov (2003) it was shown how to attain faster rates for 
qnasi-convex optimization, bnt they assnme to know the optimal valne of the objective, an 
assnmption that generally does not hold in practice. 

Among the deep learning commnnity there have been several attempts to tackle gradient- 
explosion/plateans. Ideas spanning gradient-clipping Pascann et al. (2013), smart initializa¬ 
tion Doya (1993), and more. Martens and Sntskever (2011), have shown to improve training 
in practice. Yet, non of these works provides a theoretical analysis showing better conver¬ 
gence gnarantees. 

To the best of onr knowledge, there are no previons resnlts on stochastic versions of NGD, 
neither resnlts regarding locally-qnasi-convex/locally-Lipschitz fnnctions. 

Gradient descent with hxed step sizes, inclnding its stochastic variants, is known to 
perform poorly when the gradients are too small in a platean area of the fnnction, or alter¬ 
natively when the other extreme happens: gradient explosions. These two phenomena have 
been reported in certain types of non-convex optimization, snch as training of deep networks. 

Fignre 1 depicts a one-dimensional family of fnnctions for which GD behaves provably 
poorly. With a large step-size, GD will hit the cliffs and then oscillate between the two 
bonndaries. Alternatively, with a small step size, the low gradients will canse GD to miss 
the middle valley which has constant size of 1/2. On the other hand, this exact fnnction is 
qnasi-convex and locally-Lipschitz, and hence the NGD algorithm provably converges to the 
optimnm qnickly. 

2 Definitions and Notations 

We nse || ■ || to denote the Enclidean norm. Bd(x, r) denotes the d dimensional Enclidean 
ball of radins r, centered around x, and := Brf(0,1). [N] denotes the set {1,..., N}. 

For simplicity, throughout the paper we always assume that functions are differentiable 
(but if not stated explicitly, we do not assume any bound on the norm of the gradients). 

Definition 2.1. (Local-Lipschitzness and Local-Smoothness) Let z G G, e > 0. 

A function f : JC ^ M. is called (G, e, z)-Locally-Lipschitz if for every x, y G Brf(z, e), we have 

|/(x)-/(y)|<G||x-y|| . 

Similarly, the function is , e, z)-locally-smooth if for every x, y G Bd(z,e) we have, 

|/(y) - /(x) - {V/(y),x - y>| < l||x - yf , 
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Figure 1: A quasi-convex Locally-Lipschitz function with plateaus and cliffs. 

Next we define quasi-convex functions: 

Definition 2.2. (Quasi-Convexity) We say that a function / : i—)■ M zs quasi-convex 

z/Vx, y G such that /(y) < /(x), it follows that 


(V/(x),y-x)<0. 


We further say that f is strictly-quasi-convex, if it is quasi-convex and its gradients vanish 
only at the global minima, i.e., Vy : /(y) > min^ eR'i/W ^ ||V/(y)||>0. 

Informally, the above characterization states that the (opposite) gradient of a quasi- 
convex function directs us in a global descent direction. Following is an equivalent (more 
common) definition: 

Definition 2.3. (Quasi-Convexity) We say that a function / : i—)■ M zs quasi-convex 

if any a-sublevel-set of f is convex, i.e., Vo G M the set 


'^aif) = {x : /(x) < a} is convex. 


The equivalence between the above definitions can be found in Boyd and Vandenberghe 
(2004), for completeness we provide a proof in Appendix A. During this paper we denote 
the sublevel-set of / at x by 


>5/(x) = {y : /(y) < /(x)} . 


( 1 ) 


3 Local-Quasi-Convexity 


Quasi-convexity does not fully capture the notion of unimodality in several dimension. As 
an example let x = {xi,X2) G [—10,10]^, and consider the function 


(7(x) = (l + e-"Q-i + (l + e-"Q-' . 


( 2 ) 
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It is natural to consider g as unimodal since it acquires no local minima but for the unique 
global minima at x* = (-10,-10). However, g is not quasi-convex: consider the points 
X = (log 16, — log4), y = (— log4, log 16), which belong to the 1.2-sub-level set, their average 
does not belong to the same sub-level-set since g{'x/2 + y/2) = 4/3. 

Quasi-convex functions always enable us to explore, meaning that the gradient always di¬ 
rects us in a global descent direction. Intuitively, from an optimization point of view, we only 
need such a direction whenever we do not exploit, i.e., whenever we are not approximately 
optimal. 

In what follows we dehne local-quasi-convexity, a property that enables us to either 
explore/exploit. This property ^ captures a wider class of unimodal function (such as g 
above) rather than mere quasi-convexity. Later we justify this dehnition by showing that it 
captures Generalized Linear Models (GLM) regression, see McGullagh and Nelder (1989); 
Kalai and Sastry (2009). 

Definition 3.1. (Local-Quasi-Convexity) Let x, z G M/, K,e > 0. We say that / : i—>■ 
M is {e, K,z)-Strictly-Locally-Quasi-Convex (SLQC) in x, if at least one of the following 
applies: 

1- /(x) - /(z) < e . 

2. ||V/(x)|| > 0, and for every y G B(z, e/n) it holds that (V/(x),y — x) < 0 . 

Note that if / is G-Lispschitz and strictly-quasi-convex function, then Vx, z G Ve > 0, 
it holds that / is (e, G, z)-SLQG in x. Recalling the function g that appears in Equation (2), 
then it can be shown that Ve G (0,1], Vx G [—10,10]^ then this function is (e, l,x*)-SLQG in 
X, where x* = (—10, —10) (see Appendix B). 


3.1 Generalized Linear Models (GLM) 

3.1.1 The Idealized GLM 


In this setup we have a collection of m samples {(x*, G x [0,1], and an activation 

function </> : M i—>■ M. We are guaranteed to have w* G such that: yi = 0(w*, Xj), Vz G [m] 
(we denote 0(w,x) := (/)((w,x))). The performance of a predictor w G M'^, is measured by 
the average square error over all samples. 


errm(w) 


1 

m 


^(z/i-0(w,Xi))^ . 
i=l 


(3) 


In Kalai and Sastry (2009) it is shown that the Perceptron problem with q-margin is a 
private case of GLM regression. 

The sigmoid function 0(z) = (1 -|- is a popular activation function in the held 

of deep learning. The next lemma states that in the idealized GLM problem with sigmoid 

^Definition 3.1 can be generalized in a manner that captures a broader range of scenarions (e.g. the 
Perceptron problem), we defer this definition to Appendix H. 
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activation, then the error function is SLQC (but not quasi-convex). As we will see in Section 
4 this implies that Algorithm 1 hnds an e-optimal minima of OTm(w) within poly(l/e) 
iterations. 

Lemma 3.1. Consider the idealized GLM problem with the sigmoid activation, and assume 
that ||w*|| < W. Then the error function appearing in Eguation (3) is {e,e^,w*)-SLQC in 
w, Ve > 0, Vw G Bd(0, W) (But it is not generally guasi-convex). 

We defer the proof to Appendix C 

3.1.2 The Noisy GLM 

In the noisy GLM setup (see McCullagh and Nelder (1989); Kalai and Sastry (2009)), we 
may draw i.i.d. samples {(x*, |/j)}™G x [0,1], from an unknown distribution V. We 
assume that there exists a predictor w* G such that E(x,y)....D[|/|x] = 0(w*,x), where 0 is 
an activation function. Given w G we dehne its expected error as follows: 

£:(w) = E(x,y)...D(2/ - 0(w,x))^ , 

and it can be shown that w* is a global minima of £. We are interested in schemes that 
obtain an e-optimal minima to S, within poly(l/e) samples and optimization steps. Given 
m samples from V, their empirical error OTm(w), is dehned as in Equation (3). 

The following lemma states that in this setup, letting m = G(l/e^), then erim is SLQG 
with high probability. This property will enable us to apply Algorithm 2, to obtain an 
e-optimal minima to S, within poly(l/e) samples from V, and poly(l/e) optimization steps. 

Lemma 3.2. Let 5, e G (0,1). Consider the noisy CLM problem with the sigmoid activation, 
and assume that ||w*|| < W. Given a fixed point w G B(0,1T), then w.p.> 1 — 5, after 
m > — (^+1) samples, the empirical error function appearing in Eguation (3) is 

(e, e'^, w*)-SLQC in w. 

Note that if we had required the SLQG to hold Vw G B(0, W), then we would need the 
number of samples to depend on the dimension, d, which we would like to avoid. Instead, we 
require SLQG to hold for a hxed w. This satishes the conditions of Algorithm 2, enabling us 
to hnd an e-optimal solution with a sample complexity that is independent of the dimension. 
We defer the proof of Lemma 3.2 to Appendix D 

4 NGD for Locally-Quasi-Convex Optimization 

Here we present the NGD algorithm, and prove the convergence rate of this algorithm 
for SLQG objectives. Our analysis is simple, enabling us to extend the convergence rate 
presented in Nesterov (1984) beyond quasi-convex functions. We then show that quasi- 
convex and locally-Lipschitz objective are SLQG, implying that NGD converges even if the 
gradients are unbounded outside a small region around the minima. For quasi-convex and 
locally-smooth objectives, we show that NGD attains a faster convergence rate. 
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Algorithm 1 Normalized Gradient Descent (NGD) 

Input: ^Iterat 

iions T, Xi G M'^, learning rate r] 


for t = 1.. .T 

do 


Update: 


9t 


Xt+i = Xt - rjgt where gt = Vfixt), gt = 

end for 



Return: Xt^ = 

argmiui^^ ^^^|/(xt) 



NGD is presented in Algorithm 1. NGD is similar to GD, except we normalize the 
gradients. It is intuitively clear that to obtain robustness to plateaus (where the gradient 
can be arbitrarily small) and to exploding gradients (where the gradient can be arbitrarily 
large), one must ignore the size of the gradient. It is more surprising that the information 
in the direction of the gradient suffices to guarantee convergence. 

Following is the main theorem of this section; 

Theorem 4.1. Fix e > 0, let f : M, and x* G argminxgRd/(x). Given that f is 

(e, K, x*)-SLQC in every x G Then running the NGD algorithm with T > fi:^||xi — x*|p/e^, 
and r] = e/n, we have that: /(xt) — /(x*) < e. 

Theorem 4.1 states that (■, ■,x*)-SLQG functions admit poly(l/e) convergence rate using 
NGD. The intuition behind this lies in Dehnition 3.1, which asserts that at a point x either the 
(opposite) gradient points out a global optimization direction, or we are already e-optimal. 
Note that the requirement of (e, •,-)-SLQG in any x is not restrictive, as we have seen in 
Section 3, there are interesting examples of functions that admit this property Ve G [0,1], 
and for any x. 

For simplicity we have presented NGD for unconstrained problems. Using projections 
we can easily extend the algorithm and its analysis for constrained optimization over convex 
sets. This will enable to achieve convergence of 0{l/e^) for the objective presented in 
Equation (2), and the idealized GLM problem presented in Section 3.1.1. 

We are now ready to prove Theorem 4.1: 

Proof of Theorem 4-i. First note that if the gradient of / vanishes at Xj, then by the SLQG 
assumption we must have that /(x^) — /(x*) < e. Assume next that we perform T iterations 
and the gradient of / at x^ never vanishes in these iterations. Gonsider the update rule of 
NGD (Algorithm 1), then by standard algebra we get, 

||xi+i -X*||^ = ||xt - x*f - 2r7(^t,Xt - X*) +7]^ . 

Assume that Wt G [T] we have /(xj) — /(x*) > e. Take y = x* + (e/n) gt, and observe that 
||y ~x*|| < e/n. The (e, k, x*)-SLQG assumption implies that (^t,y — xt) < 0, and therefore 

(^t,x* + (e/K)^t-Xt) < 0 =» (^t,xt - X*) > e/K . 


7 






Setting 7] = e/hi, the above implies, 


llxj+i - x*|p < ||xt - x*|p - 2r]e/K + r]‘^ 
= ||x,-x*f-eV«:^ 

Thus, after T iterations for which /(xt) — /(x*) > e we get 


0 < IIxt+i — x' 


.* 



Therefore, we must have T < k^||xi — x*||^/e^ . 


□ 


4.1 Locally-Lipschitz/Smooth Quasi-Convex Optimization 

It can be shown that strict-quasi-convexity and (G, e/G, x*)-local-Lipschitzness of / implies 
that / is (e, G, x*)-SLQC Vx G M'^, Ve > 0, and x* G arg min^giRd/(x) (see Appendix E). 
Therefore the following is a direct corollary of Theorem 4.1: 

Corollary 4.1. Fix e > 0, /ef / : h-)■ M, and x* G arg minxgKd /(x). Given that f is 

strictly quasi-convex and {G,e/G,^*)-locally-Lipschitz. Then running the NGD algorithm 
with T > G^||xi — x*|p/e^, and t] = e/G, we have that: /(x^) — /(x*) < e. 

In case / is also locally-smooth, we state an even faster rate: 

Theorem 4.2. Fix e > 0, /ef / : h-)■ M, and x* G argmin^eRd/(x). Given that f is 
strictly quasi-convex and (/3, ■\/2e//3, x*) -locally-smooth. Then running the NGD algorithm 
with T > /d||xi — x*|p/2e, and rj = ^J2e/jd, we have that: /(xt) — /(x*) < e. 

We prove Theorem 4.2 in Appendix F. 

Remark 1. The above corollary (resp. theorem) implies that f could have arbitrarily large 
gradients and second derivatives outside B(x*,e/G) (resp. B(x*, ^/2e]~(3)), yet NGD is still 
ensured to output an e-optimal point within G^||xi — x*|p/e^ (resp. /d||xi — x*|p/2ej itera¬ 
tions. We are not familiar with a similar guarantee for GD even in the convex case. 

5 SNGD for Stochastic SLQC Optimization 

Here we describe the setting of stochastic SLQC optimization. Then we describe our SNGD 
algorithm which is ensured to yield an e-optimal solution within poly(l/e) queries. We also 
show that the (noisy) GLM problem, described in Section 3.1.2 is an instance of stochastic 
SLQC optimization, allowing us to provably solve this problem within poly(l/e) samples and 
optimization steps using SNGD. 





Algorithm 2 Stochastic Normalized Gradient Descent (SNGD) 
Input: ^Iterations T, Xi G learning rate rj, minibatch size b 

for t = 1.. .T do 

Sample: V’’, and define, 


/t(x) = 

1 = 1 


Update: 

xt+i = xt - r]gt where gt = 9t = ^ 

\\9t 

end for 

Return: xt = argmin|^^^ ,,^|/i(xt) 


The stochastic SLQC optimization Setup: Gonsider the problem of minimizing a 
fnnction / : i—)■ M, and assnme there exists a distribntion over fnnctions P, snch that: 

/(x) := Ep^ 2 p[V^(x)] . 

We assnme that we may access / by randomly sampling minibatches of size b, and qnerying 
the gradients of these minibatches. Thus, upon querying a point x^ G M'^, a random mini¬ 
batch ~ is sampled, and we receive V/t(xt), where /t(x) = ^ '^i(x). We 

make the following assumption regarding the minibatch averages: 

Assumption 5.1. LetT,e,6 > 0, x* G argminxgMd /(x). There exists k> 0, and a function 
bo : ^ M, that for b > bo{e,6,T) then w.p.> 1 — 5 and Vt G [T], the minihatch average 

fti^) = i El is (e, k,x*)-SLQC in Xt- Moreover, we assume |/i(x)| < M, \/t E 

[T], X G . 

Note that we assume that bo = poly(l/e, log(T/5)). 

Justification of Assumption 5.1 Noisy GLM regression (see Section 3.1.2), is an in¬ 
teresting instance of stochastic optimization problem where Assumption 5.1 holds. Indeed 
according to Lemma 3.2, given e, 5, T > 0, then for b > f2(log(T/5)/e^) samples^, the average 
minibatch function is (e, fi:,x*)-SLQG in xt, Vt G [T], w.p.> 1 — 5. 

Local-quasi-convexity of minibatch averages is a plausible assumption when we optimize 
an expected sum of quasi-convex functions that share common global minima (or when the 
different global minima are close by). As seen from the Examples presented in Equation (2), 

^In fact, Lemma 3.2 states that for b = ri(log(l/5)/e^), then the error function is SLQC in a single 
decision point. Using the union bound we can show that for b = Vl{\og{T/6)/e^) it holds for T decision 
points 
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and in Sections 3.1.1, 3.1.2, this snm is generally not qnasi-convex, bnt is more often locally- 
qnasi-convex. 

Note that in the general case when the objective is a snm of qnasi-convex fnnctions, 
the nnmber of local minima of snch objective may grow exponentially with the dimension 
d, see Aner et al. (1996). This might imply that a general setnp where each -0 ~ "D is 
qnasi-convex may be generally hard. 

5.1 Main Results 

SNGD is presented in Algorithm 2. SNGD is similar to SGD, except we normalize the 
gradients. The normalization is crncial in order to take advantage of the SLQG assnmption, 
and in order to overcome the hurdles of plateaus and cliffs. Following is our main theorem: 

Theorem 5.1. Fix 6,e,G,M,K > 0. Suppose we run SNGD with T > fi:^||xi — x*||^/e^ 
iterations, rj = e/n, and b > max{ ^ 6,T)} . Assume that for b > bo{e,6,T) 

then w.p.> 1 — d and Wt G [T], the function ft defined in the algorithm is M-bounded, 
and is also {e, n,x*)-SLQC in x^. Then, with probability of at least 1 — 26, we have that 
/(xr) - /(x*) < 3e. 

We prove of Theorem 5.1 at the end of this section. 

Remark 2. Since strict-quasi-convexity and {G,e/G,^*)-local-Lipschitzness are equivalent 
to SLQC (App. E), the theorem implies that f could have arbitrarily large gradients outside 
B(x*,e/G), yet SNGD is still ensured to output an e-optimal point within G^||xi — x*||^/e^ 
iterations. We are not familiar with a similar guarantee for SGD even in the convex case. 

Remark 3. Theorem 5.1 requires the minihatch size to he G(l/e^). In the context of learning, 
the number of functions, n, corresponds to the number of training examples. By standard 
sample complexity bounds, n should also be order ofl/e'^. Therefore, one may wonder, if the 
size of the minihatch should be order of n. This is not true, since the required training set 
size is 1/e^ times the VG dimension of the hypothesis class. In many practical cases, the 
VC dimension is more significant than 1/e^, and therefore n will be much larger than the 
required minihatch size. The reason our analysis requires a minibatch of size 1/e^, without 
the VC dimension factor, is because we are just “validating” and not “learning”. 

In SGD and for the case of convex functions, even a minibatch of size 1 suffices for 
guaranteed convergence. In contrast, for SNGD we require a minibatch of size 1/e^. The 
theorem below shows that the requirement for a large minibatch is not an artifact of our 
analysis but is truly required. 

Theorem 5.2. Let e G (0,0.1]; There exists a distribution over convex loss functions, such 
that running SNCD with minihatch size of b = with a high probability, it never reaches 
an e-optimal solution 

We prove Theorem 5.2 in Section 5.2.3. The gap between the upper bound of 1/e^ and 
the lower bound of 1/e remains as an open question. 

We now provide a sketch for the proof of Theorem 5.1: 
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Proof of Theorem 5.1. Theorem 5.1 is a consequence of the following two lemmas. In the first 
we show that whenever all /t’s are SLQC, there exists some t such that fti^t) — /t(x*) < e. 
In the second lemma, we show that for a large enough minibatch size b, then for any t E [T] 
we have /(xj) < /t(xt) + e, and /(x*) > /^(x*) — e. Combining these two lemmas we conclude 
that /(xt) - /(x*) < 3e. 

Lemma 5.1. Let e,6 > 0. Suppose we run SNGD for T > k^||xi — x*|p/e^ iterations, 
b > bo{e,S,T), and rj = e/n. Assume that w.p.> 1 — 5 all ft’s are {e, k,x.*)-SLQC in 
Xt, whenever b > bo{e,5,T). Then w.p.> 1 — 5 we must have some t G [T] for which 
fti^t) - /t(x*) < e. 

Lemma 5.1 is proved similarly to Theorem 4.1, we defer the proof to Section 5.2.1. 

The second Lemma relates ftip^-t) — /t(x*) < e to a bound on /(x^) — /(x*). 

Lemma 5.2. Suppose b > ^ w.p.> 1 — 5 and for every t G [T].- 

/(xt) < fti^t) + e , and also, /(x*) > ft{x.*) - e . 

Lemma 5.2 is a direct consequence of Hoeffding’s bound (see Section 5.2.2). Using the 
dehnition of x^ (Alg. 2) , together with Lemma 5.2 gives: 

/(xt) - /(x*) < /t(xi) - ft{x.*) + 2e, Vf G [T] 

Combining the latter with Lemma 5.1, establishes Theorem 5.1. □ 

5.2 Remaining Proofs 

5.2.1 Proof of Lemma 5.1 

Proof. First note that if the gradient of ft vanishes at x^, then by the SLQC assumption 
we must have that fti'^t) — /t(x*) < e. Assume next that we perform T iterations and the 
gradient of ft at x* never vanishes in these iterations. Consider the update rule of SNGD 
(Algorithm 2), then by standard algebra we get: 

||Xi+i -X*||^ = ||xt - X*f - 277(^t,Xt - X*) +7]"^ . 

Assume that Vf G [T] we have fti^t) — /t(x*) > e. Take y = x* + (c/k) gt, and observe that 
||y ~ x*|| < e/n. Hence the (e, fi:,x*)-SLQC assumption implies that {gt,y — x*) < 0, thus, 

(^t,x* + (e/K)^t-xQ < 0 ^ (^t,xt - xQ > e/K . 

This implies that, if we set g = e/n then 

||x4+i - x*||2 < ||x4 - x*||2 - 2ge/K + 

= ||xt - X II -e /K . 
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So, after T iterations for which — /t(x*) > e we get 


0 < ||xt+i - x*||2 < ||xi - x*||2 - TeV^^ 


Therefore, we must have 


T < 


K Xi — X 


□ 


5.2.2 Proof of Lemma 5.2 

Proof. At each step t, the minibatch is being sampled after x^ and x* are hxed. The random 
variables /t(xj (resp. /t(x*)) are an average of b i.i.d. random variables whose expectation is 
/(xj (resp. /(x*)). These random variables are bounded, since we assume Vf, X, |/f(x)| <M 
(see Thm. 5.1). Applying Hoeffding’s bound to the b random samples mentioned above, 
together with the union bound over f G [T], and over both sequences of random variables, 
the lemma follows. □ 


5.2.3 Proof of Theorem 5.2 

We will require the following lemma, whose proof is given in App. G. 

Lemma 5.3 (Absorb probabilities). Let be a Markov chain over states 

that 0 is an absorbing state , and the transition distribution elsewhere is as follows: 


W+i|{W = z} 



Define the absorb probabilities Oj := P{3t > 0 : 


w.p. p 
w.p. 1 — p 

0|Xo = i), then: 


a* = { t ^)\ Vi > 1 

1 — p' 

Proof. To prove Theorem 5.2, we construct a counter example in one dimension. Consider 
the following distribution D over loss functions: 


/(x) 


—0.5ex w.p. 1 — e 

(1 - 0.5e)max{x + 3,0} w.p. e 


(4) 


It can be verihed that the optimum of Ex)[/(x)] is in x* = —3 , and that the slope of the 
expected loss in (—3, cxd) is 0.5e. Also notice that all points in the segment [—5, —1] are 
e-optimal. 

Suppose we use SNGD with a batchsize of 6 = i.e., we sample the gradient b times 

at any query point, and then average the samples and take the sign. Assume that at time 
t the queried point is greater than x* = —3. Let Yt be the averaged gradient over the 
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batch received at time t, and define p = P{Yt > 0), i.e., the probability that this sign is 
non-negative. Then the following is a lower bonnd on 1 — p: 

1 - p := P{Yt < 0) > (1 - e)' = (1 - , 

where (1 — e)^ is the probability that all b samples are negative. Now, consider the fnnction 
G(e) = (1 — It can be shown that G is monotonically decreasing in [0,1], and that 

G(O.l) > 0.8. Therefore, for any e G [0,0.1] we have, p < 0.2. 

Now, let be the random variables describing the qneries of SNGD nnder the 

distribntion over loss fnnctions given in Eqnation (4). Also assnme that we start SNGD with 
Xi = 0, i.e., at a distance of D = 3 from the optimnm. Then the points that SNGD qneries 
are on the discrete lattice and the following holds: 


Xt+,\{Xt = zr]} 


{i — 1)?7 w.p. p 

{i + l)p w.p. 1 — p 


Let io = [—1/p], note that io is the minimal nnmber of steps reqnired by SNGD to arrive 
from Xi = 0, to an e-optimal solntion. Now in order to analyze the probability that SNGD 
ever arrives at an e-optimal point, it is snfficient to consider the Markov chain over the lattice 
{ip}iez with the state Sq = iop, as an absorbing state. Using Lemma 5.3 we conclnde that 
if we start at Ai = 0 then the probability that we ever absorb is: 

P{3t > 0 : Xt is e-optimal |Xo = 0) < 



where we nsed p < 0.2, a bonnd of G = 1 on the gradients of losses; NGD’s learning rate 
p = e/G, and e < 0.1. □ 



6 Experiments 

A better nnderstanding of how to train deep nenral networks is one of the greatest chal¬ 
lenges in cnrrent machine learning and optimization. Since learning NN (Nenral Network) 
architectnres essentially reqnires to solve a hard non-convex program, we have decided to 
focns onr empirical stndy on this type of tasks. As a test case, we train a Nenral Network 
with a single hidden layer of 100 nnits over the MNIST data set. We nse a ReLU activation 
fnnction, and minimize the sqnare loss. We employ a regnlarization over weights with a 
parameter of A = 5 ■ 10“^. 

At first we were interested in comparing the performance of SNGD to MSGD (Minibatch 
Stochastic Gradient Descent), and to a stochastic variant of Nesterov’s accelerated gradient 
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Iteration Iteration 



500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 601 
Iteration 


(a) 


(b) 


(c) 


Figure 2: Comparison between optimizations schemes. Left: test error. Middle: objective 
value (on training set). On the Right we compare the objective of SNGD for different 
minibatch sizes. 


method Sutskever et al. (2013), which is considered to be state-of-the-art. For MSGD and 
Nesterov’s method we used a step size rule of the form rjt = r]o{l + 7 ^)“^/^, with tjq = 0.01 
and 7 = lO”"^. For SNGD we used the constant step size of 0.1. In Nesterov’s method 
we used a momentum of 0.95. The comparison appears in Figures 2(a),2(b). As expected, 
MSGD converges relatively slowly. Conversely, the performance of SNGD is comparable 
with Nesterov’s method. All methods employed a minibatch size of 100. 

Later, we were interested in examining the effect of minibatch size on the performance 
of SNGD. We employed SNGD with different minibatch sizes. As seen in Figure 2(c), the 
performance improves signihcantly with the increase of minibatch size. 


7 Discussion 

We have presented the hrst provable gradient-based algorithm for stochastic quasi-convex op¬ 
timization. This is a hrst attempt at generalizing the well-developed machinery of stochastic 
convex optimization to the challenging non-convex problems facing machine learning, and 
better characterizing the border between NP-hard non-convex optimization and tractable 
cases such as the ones studied herein. 

Amongst the numerous challenging questions that remain, we note that there is a gap 
between the upper and lower bound of the minibatch size sufficient for SNGD to provably 
converge. 
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A Equivalence Between Definitions 2.2 and 2.3 

First let us show that 2.2 2.3 

Proof of 2.2 ^ 2.3. Let x, y G such that /(x),/(y) < A. Let [x, y] be the line segment 
connecting these points; we need to show that Vz G [x, y]; /(z) < A. Assume by contradic¬ 
tion that there exists z G [x, y] such that /(z) > A. Assume w.l.o.g. that (V/(z),x —y) ^ 0 
(otherwise we could always hnd z' G [x, y] such that /(z') = f{z) and (V/(z'),x — y) 7 ^ 0), 
and let a G (0,1) such that z = ax -|- (1 — a)y. By Dehnition 2.2 the following applies: 


0< (V/(z),z-x) = (V/(z),ax+ (1 - a)y-x) = (1 - a)(V/(z), y - x) , 

0 < (V/(z), z - y) = (V/(z), ax + (1 - a)y - y) = -a(V/(z), y - x) . 

Since a G (0, 1), we conclude that (V/(z),y — x) > 0 and also (V/(z),y — x) < 0. This is 
a contradiction since we assumed (V/(z),y — x) 7 ^ 0. □ 

Let us now show that 2.3 2.2 


Proof of 2.3 2.2. Consider the 1 dimensional function h{a) = /(x -|- a(y — x)). The 

derivative of h at 0 is h'(0) = (V/(x),y — x). Therefore, we need to show that if y G 5'/(x) 
then h'(0) < 0. By the quasi-convex assumption we have that all the line segment connecting 
X to y is in Sf{'x). Therefore, for every a G [0,1] we have h{a) < h{0). This means that 


h'{0) 


lim 


h(a) — h( 0 ) 
a 


< 0 . 


□ 


B Local Quasi-convexity of g 

Here we show that the function g that appears in Equation 2 is SLQC. Denote x* = 
(-10,-10), let e G [0,1] and let x,y such that g(x.) — gi^*) > e and ||y — x*|| < e. In 
order to prove SLQC it sufficient to show that H^fxll > 0, and that {^fx, x —y) > 0 (we denote 
^fx := V 5 '(x)). Deriving at a: we have: 

g^ = Vg{^) = (e-"V(l + e-"V(l + . 

and it is clear that jl^fxH > 0, Vx G [—10,10]^, thus strictness always holds. We divide the 
proof of (^fx, X — y) > 0 , into cases: 

Case 1: Suppose that xi < 0 ,X 2 < 0. In this case it is possible to show that the Hessian 
of g is positive-semi-dehnite, thus g is convex in [—10,0]^. Since g is also 1-Lipschitz, then 
it implies that it is (e, l,x*)-SLQC in every x G [—10,0]^. 
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Case 2: Suppose that at least one of Xi,X2 is positive, w.l.o.g. assume that Xi > 0. In 
this case: 


(c/x, X - y) > ^ 


Z=1 

2 

Z=1 


e ^^{xj + 10 + {x* - Vi)) 

(1 + 

e~^^{xi + 10 — e) 


(1 + e 


> 


19e 


-10 


ee 


-lO+e 


(l + e-i0)2 (i + e-io+^)2 

g-io 

> 0 . 


where in the second line we used ||y — x*|| < e. In the third line we used e G [0,1], also 
10 = argmax^ 6 [o,io] p+l-tp , and min 26 [_io,o] "" (i+^-°) 2 ^^ > - • The fourth line uses 

g-io+e > and the last line uses e < 1. 

The above two cases establish that g is (e, 1, x*)-SLQC in every x G [—10,10]^, e G (0,1]. 


C Proof of Lemma 3.1 

Proof. Given e > 0, we will show that is (e, e^, w*)-SLQC at every w G B(0, W). Recall 
0 ( 2 :) = (1 + and consider ||w|| < W such that OTm(w) = T ~ 0(w, x*))^ > e. 

Also let V be a point e/e'^ close to the minima w*, we therefore have: 


(Verrm(w),w - v) 




m 


m 


^ (1 + 
i=l ^ ' 




hw,x,) 


0 (1 + 


-(0(w,Xi) -l/i)((w,Xi) - (v,Xi)) 


:(0(w,Xi) - 0(w*,Xi))((w,x,) - (w*,Xi) + (w* - v,Xi)) 




4g(w,xP 


“ m (1 + 


(0(w,X0 - 0(w*,Xi))^ - 


ee 


-w 


> 2e ^errm(w) - 

> 0 . 


ee 


2 

-w 


(5) 


In the second line we used yi = 0(w*,Xj), which holds for the idealized setup. In the third 
line we used the fact that (f){z) is monotonically increasing and 1/4-Lipschitz, and therefore 
{(j){z) — 0 ( 2 :')) — z') > A ( 0 ( 2 ;) — 0 ( 2 ;'))^. We also used |(w* — v,Xj)| < ||w* — v|| ■ ||xj|| < 

ee“^, and |0(w,Xj) - 0(w*,Xj)| < 1 ; Finally we used max^ (i+e^)^ — fourth line 
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uses min||^||<vi/ > e The last line follows since we assume errm(w) > e. The 

strictness is immediate since (VOTm(w), w —v) > 0, therefore, the above establishes SLQC. 


We will now show that ertm is generally not qnasi-convex: Consider the idealized 
setup with two samples (xi,j/i), (x 2 ,j/ 2 ) where xi = (0, — log4),X2 = (—log4,0) and yi = 
y 2 = 1/5, The error function is therefore: 


errm(w) 


1/1 1 y 1 /I 1 

2 Vs 1 + y 2 Vs 1 + 


and it can be verihed that the optimal predictor is w* = (1,1), yielding erfm(w*) = 0. 
Now let wi = (3,1),W2 = (1,3), it can be shown that erfm(wi) = errm(w 2 ) < 0.018, yet 
effm(wi/2 + W 2 / 2 ) > 0.019. Thus erfm is not quasi-convex. □ 


D Proof of Lemma 3.2 

Proof. Since we are in the noisy idealized setup, and Wi, yi G [0,1] the folllowing holds 

y. = 0(w*,Xi) . 

where are zero mean, independent and bound random variables, Wi G [m], |/j| < 1. 

Therefore erfm can be written as follows: 

^ m 

err,„(w) = — V(j/i - (/(w, x^))^ 
m 

2=1 

^ m ^ m 

= — y'(0(w*,Xi) - (/(w,Xi))2 H-y^^i0i(w) + c . 

2=1 2=1 

where c = ^ ^ii ^i(w) = 2 ((/(w, x,) — 0(w*, Xj)). We therefore have: 

^ m ^ m 

errm(w) - erf,n(w*) = — y'(0(wVxi) - 0(w,Xi))^ H-V'66'i(w) . 

2=1 2=1 
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Now fix e > 0, and let w be a fixed point in B(0,bb) such that errm(w) — errm(w*) > e. 
Also let V be a point e/e'^ close to the minima w*, we therefore have: 


(Verr^(w),w - v) 

m 


2 g(w,xP 


m 


> — 


2 

> — 
m 


i=l 

m 

E 

i=l 

m 

E 

i=l 

-,-w 


(1 

+ 

,Xr>)2 


g{w,x, 


(1 

+ 

,Xr>)2 


4g(w,^ 


(1 

+ eX^' 

,Xr>)2 


4g(w,^ 


(1 

+ 

,Xr))2 


( 0 (w,Xi) -yi)((w,Xi} - (v,Xi)) 


(0(w,Xi) - (j){w*,Xi) - ^i)((w,Xi) - (w*,Xi) + (w* - V,Xi)) 

w 1 ™ 

(0(w, Xi) - (^(w*, Xi))2-— + —Y^ 6Ai(w) 

2 m 


i=l 


ee 


-w 


H-^ 6Ai 

m 


w 


2=1 


-W 


>2e^ (errm(w) - errm(w*)) - ^ -^-^ ^iX 

2 m 


w 


2=1 


O 1 _ 

2 m 

i=l 


W . 


( 6 ) 


where we denote Ai(w) = ^^^,^*,*’ 1)2 ((w*. th)-(w, x,)), and Ai(w) = A.(w)-W- 
The argumentation justifying the above inequalities is the same as is done for Equation (5) 
(see Appendix C). According to Equation (6), the lemma is established if we can show that 




1 

m 


X]6Ai(w) > 

i=\ 


—ee 


-w 


The are zero mean and independent, and |,^jAi(w)| < 4(1E + 1), thus applying 

Heoffding’s bound we get that the above does hold for m > — —log(l/5). Note that 
in bounding |,^jAj(w)|, we used |.^j| < 1, also w,w* G B(0,fE), and max^ < |. □ 


E Locally-Lipschitz and Strictly Quasi-Convex are SLQC 

In order to show that strictly quasi-convex function which is also (G, e/G, x*)-Lipschitz, is 
SLQC, we require the following lemma: 

Lemma E.l. Let z G and assume that f is {G,e/G^z)-Locally-Lipschitz. Then, for 
every x with /(x) — /(z) > e we have B{z,e/G) C S'/(x) 

Proof. Recall the notation S'/(x) = {y : /(y) < /(x)}. By Lipschitzness, for every y G 
B{z,e/G) we have /(y) < /(z) + e. Combining with the assumption that /(z) + e < /(x) 
we obtain that y G ^/(x). □ 
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Therefore, if /(x) — /(x*) > e, then 'iy G B(x*, e/G) it holds that /(x) — /(y) > 0, and 
since f is strictly quasi-convex, the latter means that (Vf(x),y — x) < 0, and ||Vf(x)|| > 0. 
Thus (e,G',x*)-SLQC is established. 

F Proof of Theorem 4.2 

The key lemma, that enables us to attain faster rates for smooth functions is the following: 

Lemma F.l. Let x* be a global minima of f. Also assume that f is (/3, y^2e//3, x*)-locally- 
smooth. Then, for every x with /(x) — /(x*) > e we have B{x*, ^J2e/f3) C Sf{x). 

Proof. Combining the dehnition of local-smoothness (Def. 2.1) together with V/(x*) = 0 we 
get 

l/(y) - /(x*)| < ^||y - X*f, Vy e B(x*, y/2e//3) 

Therefore, for every y G B(x*, fP) we have /(y) < /(x*) -1- e. Combining with the 
assumption that /(x*) + e < /(x) we obtain that y G *S'/(x). □ 

The proof of Theorem 4.2 follows the same lines as the proof of Theorem 4.1. The main 
difference is that whenever /(x^) — /(x*) > e, we use Lemma F.l and quasi-convexity to 
show that for y = x* -|- ^J2e/(3gt it follows that 

(V/(xt),y-Xi) < 0 . 

We therefore omit the details of the proof. 

G Proof of Lemma 5.3 

Proof. Using the stationarity and Markov property of the chain, we can write the following 
recursive equations for the absorb probabilities: 

Oi = (1 - p)ai+i pat-i, Vi > 1 (7) 

ai = (1 -p)a2 +p (8) 

Lets guess a solution of the form, = coph where p is the decay parameter of the absorb 
probabilities. By inserting this solution into equation (7) we get an equation for p: 

(1 -p)p^ -p-hp = 0 . 

And it can be validated that the only nontrivial solution is p = using the latter p in 
equation (8) we get cq = 1, and therefore we conclude that: 

Oii = Vi > 1 

□ 
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H A Broader Notion of Local-Quasi-Convexity 

Definition 3.1 describes a rich family of function, as depicted in Section 3.1.1, and 3.1.2. 
However, it is clear that it does not capture piecewise constant and quasi-convex functions, 
such as the zero-one loss, or the Perceptron problem. 

In some cases, e.g. the Perceptron problem, we may have an access to a direction oracle, 
^ I—)■ This oracle is a proxy to the gradient, aiming us in a global ascent (descent) 

direction. Following is a broader dehnition of locally-quasi-convex functions: 

Definition H.l. (Local-Quasi-Convexity ) Let x, z G k, e > 0. Also let ^ i—>■ M'^. 

We say that / : i—)■ M zs (e, k, t.)-S trictly-Locally-Quasi-Convex (SLQC) in x, with respect 
to the direction oracle Q, if at least one of the following applies: 

1- /(x) - /(z) < e . 

2. ||^(x)|| > 0 , and for every y G B(z, e/n) it holds that (^(x), y — x) < 0 . 

Thus, Dehnition 3.1, is a private case of the above, which takes the gradient of / to be 
the direction oracle. Note that we can show that NGD/SNGD and their guarantees still 
hold for SLQG functions with a direction oracle. The algorithms and proofs are very similar 
to the ones that appear in the paper, and we therefore omit the details. 

In the following section we illustrate a scenario that hts the above dehnition. 


H.l The 7 - margin Perceptron 

In this setup we have a collection of m samples {(xj,?/j)}™^ G x {0,1} ,and we are 
guaranteed to have w* G such that: |/i(w*,Xj) > 7 , Vi G [m]. Thus, using the sign of 
(w*,Xj) as a predictor, it classihes all the points correctly (with a margin of 7 ). 

Letting 0 be the zero-one loss 0(z) = l^>o, we measure the performance of a predictor 
w G by the average (square) error over all samples. 


errm(w) 


1 

m 


^(|/i-0(w,Xi))^ . 

i=l 


(9) 


Glearly, the gradients of OTm(w) vanish almost everywhere. Luckily, from the convergence 
analysis of the Perceptron (see e.g. Kalai and Sastry (2009)), we know the following to be a 
direction oracle for OTm(w): 


11L 

^(w) = — y'(0(w,X0 -?/i)Xi . (10) 

m ^ ^ 

i=l 

The next lemma states that in the above setup, the error function is SLQG with respect 
to Q. This implies that Algorithm 1 hnds an e-optimal minima of erfm(w) within poly(l/e) 
iterations. 
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Lemma H.l. Consider the 'y-margin Perceptron problem. Then the error function appearing 
in Eguation (9) is {e,2/'j,w*)-SLQC in w, Ve G (0,1), Vw G with respect to the direction 
oracle appearing in Eguation (10). 

Proof. Given e G (0,1), we will show that is (e, 2 / 7 , w*)-SLQC at every w G M.'^. 
Consider w G such that OTm(w) = ^ ~ 0(w,Xj))^ > e. Also let v be a point 

'ye/2 close to the minima w*, we therefore have: 


(^(w), w - v) 




m 


^ Xi) - 0(w*, Xi))2 - 76/2 

m ^ 


> 76 — 76/2 

> 0 . 


( 11 ) 


In the second line we used j/j = 0(w*, x*), which holds by our assumption on w*. In the fourth 
line we used the fact that ((/>(w, Xj) — 0(w*, Xj))((w, Xj) —(w*, x*)) > 7 ((/)(w, x*) —(j){w*, Xj))^, 
which holds since w* is a minimizer with a 7 -margin. We also used OTm(w) < 1, and 
|(w* - V, Xi)| < ||w* — v|| ■ ||xj|| < 76 / 2 . Lastly, we use errm(w) > e. 

The strictness is immediate since (^(w),w — v) >0, therefore, the above establishes 


SLQC. 


□ 
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